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ABSTRACT 

This paper addresses the issue of solving the time-dependent incompres- 
sible Navier-Stokes equations on the Connection Machine 2, for the problem 
of transition to turbulence of the steady flow in a channel. The spectral algo- 
rithm used serially requires 0(N 4 ) operations when solving the equations on 
an N x N x N grid; using the massive parallelism of the CM, it becomes an 
0(N 2 ) problem. Preliminary timings of the code, written in LISP, are 
included and compared with a corresponding code optimized for the Cray-2 
for a 128 x 128 x 101 grid. 

KEYWORDS: Parallel Algorithms, SIMD Architectures, Navier-Stokes equations. Spectral 
Methods 
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Fast solutions to the Navier-Stokes equations are of enduring interest in computational 
fluid dynamics. The level of resolution required to accurately simulate the relevant physics of 
flows involving transition to turbulence may span scales ranging eight orders or magnitude or 
more. Existing supercomputers still lack the capacity to resolve such problems. In this paper, 
we examine the possibility of employing massive parallelism towards meeting the ever- 
increasing computational demands of large-scale fluid dynamic simulations. 

The general framework is the time-dependent incompressible Navier-Stokes equations. 
One of the methods of choice for solving these equations when a high degree of accuracy is 
required is spectral methods, and we are in the process of examining the feasibility of imple- 
menting spectral methods on the Connection Machine 2. The particular tensor-product matrix 
formulation used here involves kernels which are amenable to implementation on this archi- 
tecture. These basic kernels are presented, along with a spectral algorithm that uses these ker- 
nels to solve the Navier-Stokes equations. 

MACHINE MODEL 

The Connection Machine 2 [6,2] is a parallel architecture with 65,000 processors. The 
machine can be segmented into quarters, and all the work on this paper was done on a 16K 
processor machine. Each processor has 64K bits of memory, or 2K of single precision floating 
point numbers. Each subset of 32 processors has a floating point co-processor available. 

The processors are SIMD (single instruction multiple data), meaning that all the proces- 
sors must do exactly the same thing at the same time. A serial processor, called the con- 
troller, gives the same instructions to the parallel array or processors. As part of the instruc- 
tion, scalar values can be broadcast. For instance, one could instruct all processors to add ‘5’ 
to location A in their local memories. 

The processors are interconnected in a topology that is hyper-cube like. For the purposes 
of this paper it is sufficient to consider the machine as arranged as a 128 by 128 toroidal grid 
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of processors, where each processor can communicate with its north, south, east, and west 
neighbors. A communication facility called the router can also be used for more generalized 
communication. 

NUMERICAL METHOD 

In this paper we solve the time-dependent incompressible Navier-Stokes equations: 


Tf t + Tf-W = -VP + vV 2 !? 
V-Tf = 0 


(la) 

(lb) 


The method employed for incremental time solution is a standard splitting method [1], 
With Tf representing the three-dimensional velocity vector, we advance the solution from time 
level V to level ’n+1’ via: 


Tf* + Tf -Vif = vV 2 Tf* 

which solves the momentum equations without the pressure terms, followed by: 

T? t n+1 = -VP n+1 
V-Tf +1 = 0 


( 2 ) 


(3a) 

(3b) 


which solves for the pressure and corrects the velocity to satisfy continuity. This second step 
is actually carried out in two stages: 


y2 p n+l _ y.-tf* 

Tf t n+1 = -VP n+1 


(4a) 

(4b) 


by utilizing a backward-Euler time discretization of eq. 4b. The time discretization employed 
in advancing eq. 2 is a mixed implicit-explicit scheme, with third-order low-storage Runge- 
Kutta for the nonlinear convection terms, and Crank-Nicholson for the viscous terms. 

Spatial discretization is carried out using a spectral collocation method; Fourier series are 
used in the streamwise and spanwise directions of the channel, with Chebyshev series used in 
the normal direction between the channel walls. The now-standard matrix method is used for 
evaluating derivative quantities at the collocation points [1]. That is, the approximate deriva- 
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tives of a function evaluated at the collocation points may be computed directly by a matrix- 
vector product; the coefficients of the basis function series need not be explicitly calculated. 
Formulae for the required Fourier and Chebyshev differentiation matrices are known in closed 
form [1], Although at first glance this appears to require N 6 operations to compute the 
derivative in one direction at each point of an N x N x N three-dimensional grid, the compu- 
tation can actually be carried out in N 4 operations, using the fact that the discretizations in 
each coordinate direction are separable. This is known as a tensor-product formulation. Simi- 
larly, if the overall discrete differential operator is linear and separable, its characteristic 
matrices may also be written in tensor product form, and thus the overall operator may be 
diagonalized in 0(N 4 ) operations [1], 

It may be noted that since Fourier discretizations are used in this application, the scheme 
may be formulated using Fast Fourier Transforms (FFT’s) rather than tensor products, with 
asymptotic work of 0(N 3 logN) rather than 0(N 4 ). We decided, however, to study a general 
implementation of spectral collocation solution of the Navier-Stokes equations, rather than 
limit the implementation to the restrictive periodic form. The additions to the time-split 
scheme to enforce general boundary conditions in all directions are minor [5]. Additionally, 
for the initial proof-of-concept phase of this project, approximate pressure boundary condi- 
tions are used. 

KERNEL OPERATIONS 

The application we report here is on a 16K CM, with its processors configured as a 128 
x 128 toroidally-cyclic grid; we denote processor number by the subscripts ’i,j\ The three- 
dimensional computational grid is mapped onto the processors in a straightforward manner, 
with the third direction, the ’z’ or ’normal’ direction, stored locally in each processor, and 
denoted by the subscript ’k’. Given the storage requirements of the scheme and the available 
local memory of each processor, the z-direction is discretized with 101 points, giving an 
overall mesh of about 1.65 million points. 
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The bulk of computation in this scheme lies in the tensor product operations. In indical 
notation, these operations in each of the three directions are defined: 

T x ( L , u ) = Lji u,j k (5a) 

1=1 

Ty( L y , u ) = £ Lj)f u iUc (5b) 

1=1 

T z (L z ,u)= 2 L^ujj, (5c) 

1=1 

The x-direction tensor product is simply a matrix multiply between L x and each of the 
k-planes of U. Likewise, the y-direction tensor product is U(L y ) T for each k-plane. These 
matrix multiplies are carried out in O(N) operations instead of 0(N 3 ), using the systolic algo- 
rithm [3]. This is because all N 2 processors can be used simultaneously. The algorithm 
works as follows: Before starting the multiply, the first matrix is skewed so that each row in 
the matrix is shifted over one position for each row down (i.e. the elements in row 3 are 
shifted 3 places to the right). Similarly, the columns of the second matrix are shifted down 
for each column over. The multiplication is then performed by shifting each row of the first 
matrix up one and each column of the second matrix over one, multiplying the elements and 
accumulating the result, then shifting again. This pattern is repeated N times to complete the 
product. The overall tensor product requires N separate matrix multiplies (one for each k- 
plane) so the product requires 0(N 2 ) operations. 

The summation in the third tensor product is independent of (i,j), so each processor 
decouples and performs a matrix-vector product. The operator L z is stored in the CM con- 
troller and broadcast element-by-element in the process of computing T Z (L Z , u ). 

ALGORITHM IMPLEMENTATION 

The algorithm consists of applications of the three tensor products defined above (eqs. 5), 
and direct element-by-element sums ( + ) and multiplies ( x ). The velocity variables are 
denoted u n , n= x to z, and P for the pressure. D n and D™ (n= x to z) are used to denote the 
spectral differentiation operators, first and second derivative respectively, in each of the three 
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begin (time step) 

e n = 0, n = x to z — initialize 

for m= 1 to 3 do: -- loop on the three steps of the Runge-Kutta 

for n= x to z do: — loop over components of 

— the momentum equation 

- Compute the RHS of the Crank-Nicholson step (eq. 2); the first three terms 

- are viscous terms, the next three terms are convection terms, and the last 

- is due to the Runge-Kutta scheme. 

f, = T* [d"u„] + T y [d",u„] + Tj (d“, u„) 
k m x [u x x T x , u n j + Uy x Ty |o y , u n j + u z x T z |d , u n | j + ®n 

e n = d,),e n + d^fi -- update temporary for Runge-Kutta scheme 

- Do tensor-product diagonalization for inversion of viscous operator 

- in Crank-Nicholson 

f 2 = T x (R\Ty(R>',T 2 (R*,f 1 ))) 
f 3 = C m xf 2 

gn = T x [Q\T y (Q>',T 2 ((y,f 1 ])) 
end (for n) -- end components loop 

u n = g n , n = x to z — update velocities for next step of Runge-Kutta 
end (for m) - end R-K steps loop 

- Compute pressure; generate RHS of 4a 

g, = k 4 [t x [d\u x ] + T y [DX.Uy] + T 2 [d*,u 2 ] ] 

- Invert Laplacian by tensor-product diagonalization 

g 2 = T x (R-,Ty[R>,T 2 (R',g,]]) 
g 3 = C 4 xg 2 

P = T x (o\T y (QM 2 ((y,g3))) 

-- Correct Velocities using eq. 4b 

u n = u n + k 5 T n [d" ,p] , n=x to z 

end (time step) 

Figure 1. The Algorithm for one time step 
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directions. These operators are used in the appropriate tensor-product form to compute 
derivatives of the dependent variables as needed. The characteristic matrices used to diago- 
nalize the implicit operator are denoted as Q n and R n (n= x to z), and are also used in the 
appropriate tensor-product form. In addition, the following quantities are used: 

• k m , m= 1 to 5, are algorithm and case-specific constants. 

• djJ, and d 2 , m= 1 to 3 are constants from the Runge-Kutta algorithm. 

• C m , m= 1 to 4 is an n x n x n field of constants used in the tensor-product diagonali- 
zation process. 

• e n , f n , and g n are temporary field arrays, used for clarity of presentation. 


The algorithm for one time step is given in Figure 1. 

COMPLEXITY AND TIMINGS 

By using N 2 processors, the overall complexity reduces to 0(N 2 ) instead of 0(N 4 ). 
Hence, the parallel solution time should clearly be superior to serial solutions. However, one 
tradeoff with having a large number of processors is that they are not individually as fast as 
one might like. There is also communications overhead. Timing results are made comparing 
a CM2 and similar code on a Cray-2. For smaller problems, such as a 32 3 problem, the Cray 
is significantly faster than the CM. However, as the problem size is increased, the factor of 
0(N 2 ) dominates, and the CM becomes more competitive. 

The table below shows timing results for the two codes. The Cray achieves better tim- 
ings, but this result is a somewhat unfair comparison. The Cray results were achieved on 1 
processor of a Cray2 using the local memory micro-coded matrix operations provided by the 
SARL library which achieve peek performance. The CM code is written in *Lisp, a higher 
level language, and is not particularly well optimized. Nevertheless, the CM does achieve a 
rate of about 60 Megaflops (counting the multiply and add as 2 instructions). 


* 
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Cray2 (1 proc) vs CM2 (16K proc) 

solving a (128 x 128 x 101) problem 

operations 

Cray2 

CM2 

direct multiply 

0.068 

0.015 

matrix op T 

1.41 

7.89 

matrix op T y 

y 

1.55 

7.90 

matrix op T 7 

1.55 

5.70 

overall 

216. 

857. 


Table 1 -- Timing Results in Seconds. 


It seems obvious that for even larger problems, the CM algorithm will be much faster than the 
Cray. However, using this method, the 128x128x101 problem is the largest practical prob- 
lem on a 16K CM2 because of the size of local memories. Using a 65K processor CM2, we 
estimate that the discretization of a (256x256x101) problem could be solved several times fas- 
ter than a Cray-2. 

CONCLUSIONS 

The results obtained from implementing the Navier-Stokes equations were quite 
encouraging. The Connection machine was slower but still competitive with respect to time 
with a Cray, while being significantly more cost effective. More spectacular results can be 
expected with larger versions of the Connection Machine. 

There is one final interesting point concerning this work: We have shown the efficient 
implementation of an implicit time-stepping algorithm utilizing a global discretization scheme 
on a SIMD parallel-processing machine. This is contrary to the "popular wisdom" which 
holds that SIMD machines are useful only when nearest-neighbor operations, such as those of 
an explicit finite-difference scheme, are carried out. The potential exists for this implementa- 
tion to provide numerical simulations of transition to turbulence in channel flow at resolutions 
impractical on present supercomputers. 
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